Many purported pseudogenes in bacterial genomes are bona fide genes

Background Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. Results Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R2 = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors. Conclusions Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10137-0.


S. aureus
Pseudogenes per Mbp H iS e q 2 5 0 0 H iS e q X M iS e q v 3 H iS e q 2 5 0 0 H iS e q X M iS e q v 3 H iS e q 2 5 0 0   Predicted models for the absolute difference in frameshifts (top) and internal stops (bottom) per Mbp from a source genome for assemblies generated from simulated single end reads as coverage varies against a fixed quality score of 34 for the short reads (left) and as quality score varies against a fixed coverage of 50-fold for the short reads (right).

Figure S6: Average Nucleotide Identity (ANI) correlations with assembly measures
ANI is shown versus extractable statistics about the assemblies generated from simulated E. coli reads.ANI versus the absolute difference between the relative frameshifts per Mbp of the generated assembly and the reference assembly (top left).ANI versus the absolute difference between the relative internal stops per Mbp of the generated assembly and the reference assembly (top right).ANI versus the total number of contigs in the generated assembly (middle left).ANI versus the contig N50 normalized to total nucleotides in the generated assembly (middle right).ANI versus the Kolmogorov-Smirnov test p-value when testing the distribution of CDS lengths for the generated assembly against the distribution of CDS lengths for the source assembly (bottom left).ANI versus the Kolmogorov-Smirnov test D-statistic when testing the distribution of CDS lengths for the generated assembly against the distribution of CDS lengths for the source assembly (bottom right).Coefficient of determination, spearman's rho, and p-value of the fitted slope being significant are included in each panel.

Table S1: Pseudogene distributions by submitter choices
The dominant submitter combination in all cases was SPAdes and a single set of Illumina reads.The five accompanying plotted minor distributions for each panel in Fig. S2 were compared against the major distribution for that species with the Kolmogorov-Smirnov test, recording the p-value for both frameshifts (FS) and internal stops (IS), along with the distribution counts.

Figure S1 :FigureFigure S1 :
Figure S1: Mapping disagreements for E. coli strain NR 51487 Figure S2: Pseudogene distributions within species Figure S3: Fitted model coefficients for assemblies from simulated E. coli reads Figure S4: Modeled behavior of assemblies from simulated reads by average quality score Figure S5: Modeled behavior of assemblies from simulated single end reads Figure S6: Average Nucleotide Identity (ANI) correlations with assembly measuresTable S1: Pseudogene distributions by submitter choices

Figure S3 :
Figure S3: Fitted model coefficients for assemblies from simulated E. coli readsCoefficients for modeled intercept, coverage, and quality for combinations of simulated Illumina sequencing model and assembler.Each cell is divided diagonally with the upper diagonal representing paired end reads, and the lower diagonal representing single end reads.Coefficients with a Bonferonni corrected p-value < 0.01 are appended with an asterisk (*) in the cell bisect.The logistic regression coefficients give the change in log odds of the number of pseudogenes per assembly given a unit increase in the fold coverage or Q-score.

Figure S4 :
Figure S4: Modeled behavior of assemblies from simulated reads by average quality scorePredicted models for the absolute difference in frameshifts (top) and internal stops (bottom) per Mbp from a source genome for assemblies generated from simulated reads as average quality scores vary for an array of sequencing platforms and assemblers.Results present are at a fixed coverage of 50-fold.

Figure S5 :
Figure S5: Modeled behavior of assemblies from simulated single end reads

in Table S1. Inset
legend names refer the rows in the associated supplemental table.